+++ title = "Deep Dive | Decision Trees" summary = "Decision trees are a supervised learning algorithm used for classification and regression.This post will dive deep into the workings of a decision tree. Along the way, I will walk through a deatiled example and showcase some neccessary skills required to tune and improve any machine learning algorithm. The post will also exhibit some exploratory data analysis and visualization practices that can help better understand the data" date = "2019-08-16T13:39:46+02:00" description = "An in depth guide for decision trees using sci-kit learn featuring data visualization, feature engineering, hyperparameter tuning and cross validation" author = "Arzan Irani" tags = ["hugo"] categories = ["pseudo"] image = "img/blog/deep_dive_decision_trees/title_img.png" +++ Check Decision trees are a supervised learning algorithm used for classification and regression.This post will dive deep into the workings of a decision tree. Along the way, I will walk through a deatiled example and showcase some neccessary skills required to tune and improve any machine learning algorithm. The post will also exhibit some exploratory data analysis and visualization practices that can help better understand the data
Decision trees are a supervised learning algorithm used for classification and regression.
As the name suggests, a decision tree makes a series of decisions based on each feature in the dataset by following a if-this-then-that logic.
Imagine a hypothetical situation in which there are 60 students in a class room (30 male, 30 female) and all the girls love playing volleyball, while the boys do not.
In this hypothetical scenario, we could say that 'if the gender is Female, then the person would like volleyball'. Hence the value of the Gender variable would act as a good predictor for whether the person likes volleyball or not. This process of partitioning is called splitting.
A decision tree that would split the data set on the Gender variable would hence be able to make good predictions about whether a random student likes volleyball or not.
import pandas as pd
data = pd.read_csv('./data/volleyball_example.csv')
data.head(5)
The decision making process of the algorithm is influenced by what is called entropy or information gain. Entropy is defined as the measure of disorder. In its original form, let's say the entropy of our dataset was 1. But after splitting the data based on the gender, we have now created two smaller datasets (in this case one of 30 Male and another of 30 Female students). These smaller datasets have lesser disorder in them on account of the fact that they both have only one type of gender compared to the two in the original data set.
By measuring the difference in the entropy of the original data set to the sum of the entropies of the two smaller datasets, we can measure how much disorder was lost as a result of our splitting decision or in other words how much order or information was added or gained. </font>
$Entropy(Original Data) - (Entropy(Females) + Entropy(Males)) = Information\ gained$
Each decision or splitting step in a decision tree is referred to as a node. At a node, the decision tree would consider every feature in the dataset and try to split the data-set based on a certain threshold of that feature. **The feature that gets selected is the one the provides the most information gained after the split. **
Alternatively, a decision tree can use what is called the gini impurity. The gini impurity is a quantitative measure that calculates the probability of incorrectly classifying a randomly selected data point based on the distribution of the classes.
Going back to our hypothetical volleball example, let's consider the following: 30 of the students prefer volleyball, while 30 of the students do not. Hence the probability that a randomly selected student likes volleyball is 0.50.</font>
| Selected Student | Prediction | Probability | |
|---|---|---|---|
| Case 1 | Likes volleyball | Does not like volleyball | $0.5 * 0.5 = 0.25$ |
| Case 2 | Likes volleyball | Likes volleyball | $0.5 * 0.5 = 0.25$ |
| Case 3 | Does not like volleyball | Likes volleyball | $0.5 * 0.5 = 0.25$ |
| Case 4 | Does not like volleyball | Does not like volleyball | $0.5 * 0.5 = 0.25$ |
As can be seen from the above table, the probability of us incorrectly classifying a randomly selected student is the sum of Case 1 and Case 3 i.e.
$0.25 + 0.25 = 0.5$
Mathematically, the gini coefficient can be expressed as: </font>
$Gini\ Coefficient = \sum_{i=1}^C p(i)*(1-p(i))$
where,
C - Total number of classes
i - is the iterator of each class
p(i) - probability of being that class
1 - p(i) - probability of not being that class
The decision tree makes a split based on the feature and the threshold that results in the lowest gini coefficient after the split.
As can be expected, decision tree regressors require an alternative measure to quantify the goodness of a split. One method to do so is to monitor the mean squared error and select the feature and threshold that reduces the error the most.
Note: Decision tree regressors do not predict on a continuous range. Instead, the prediction is the mean of the response values left in the leaf node in which you land.
Alternatively, we could also use the Friedman's MSE or the mean absolute error (MAE) to help quantitatively decide the best split.</font>
Now that we understand the mechanism that decision tree algorithms use to identify the best split, let us understand the hyperparamters that can be tuned to control the way the algorithm learns from the data. Tuning hyperparameters is an important step of any machine learning algorithm.
A full list of hyperparameters can be found in the sklearn documentation:
Decision Tree Regressors: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Decision Tree Classifiers: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
We'll focus on a few hyperparameters that I consider important through my experience. </font>
The criterion is the quantitative measure that can be used by the algorithm to make the decision of the best split.
For decision tree classifiers, the options are gini impurity or entropy. There doesn't seem to be a significant advantage in picking one over the other, so either should work in practice.
Decision tree regressors, on the other hand, could use mean squared error, friedman mean squared error, or mean absolute error. The difference between mean squared error and the mean absolute error is that the former minimizes the loss with the mean, while the latter does so with the median. The friedman mean squared error accounts for a weighted mean squared error that accounts for the number of samples in the left and right split of a node to calculate an improved or weighted mean squared error. </font>
Each split in a node of a decision tree accounts for an additional layer of depth. For instance, the first node in a decision tree is referred to as the root node. This is also known as a decision stump. Splitting the root node results in two daughter nodes each with a smaller subset of the data and in turn an increase in the depth of the tree. The max_depth hyperparameter allows you to dictate the maximum depth that a decision tree may attain before the algorithm must stop.
How to use this?
The greater the depth of the tree, the more complex the algorithms learning path. By setting a tree with a larger depth, the algorithm can more precisely fit the learning data. As with all hyperparameters, there isn't a golden rule that fits all situations. However, a general understanding that you may have is that the deeper the tree, the more likely it is to over-fit the training data and conversely, the smaller the depth is the more likely it is to under-fit the training data.
As we've understood by now, by traversing down a tree we will encounter nodes splitting into daughter nodes and each daughter node would contain a smaller subset of the data to be further split based on a certain feature and threshold. It isn't necessary that each pair of sister nodes (daughter nodes originating from the same parent node) have the same number of data points in them. In fact, more often than not, sister nodes would not have the same number of data points in them. The min_samples_split hyperparameter allows you to govern which node can be further split and which cannot based on the number of data points left in them.
How to use this?
Just as in the case with max_depth, the min_samples_split hyperparameter also helps govern how complex or how strongly the algorithm is fitting the training data. Having a really small number for min_sample_split would allow the algorithm to dissect the data-set until a very small number of data points remain in each node - provided the maximum allowable depth of the tree (max_depth) hasn't been reached. The Smaller min_samples_split is, the more complex is the learning path of your algorithm and the higher is its tendency to over-fit the training data. Conversely, the higher the value the more likely it is to under-fit the training data.
Before we understand this hyperparameter, it is important to recollect the difference between a leaf node and an internal node. Contrary to regular node, a leaf node is one that does not have any daughter/children nodes. It marks the end of the splitting of the data in that branch of the decision tree. The min_samples_leaf allows us to control how many samples must be left in a daughter of an internal node in order for the split to be allowed. The purpose of min_samples_leaf can often be hard to differentiate from min_samples_split, so let's consider an example: Let's say an internal node has 10 data points, and we've set the min_samples_split to 7 and the min_samples_leaf to 5. Since there are more than 7 data points, we pass the min_samples_split requirement. However, if the resulting split results in a daughter node having less than 5 data-points then it would violate the min_samples_leaf threshold and the split would not be allowed.
How to use this The primary difference in effect between min_samples_split and min_samples_leaf is that the former governs whether node can be split, while the latter governs whether the resulting split creates nodes that are viable. By setting a min_samples_leaf to a really small number, we allow the the algorithm to learn the training data more precisely, which may result in over-fitting. Another way to understand this is that leaves with smaller number data-points allow the algorithm to dissect the data into smaller groups than would more likely be unique to the training data, and might not be reflective of the real world or unseen data.
Decision trees can accept categorical and numerical data. However, the programming and packages you use will govern the format in which you can pass in categorical variables. Since we are using sci-kit learn in this example, I'll explain the procedure for the same.
Sci-kit learn's decision tree algorithm requires categorical variables to be encoded into numbers. Hence, a variable like sex that could have values of 'male' and 'female' would need to be transformed to '1' and '0'.
</font>
The decision tree algorithm will throw an error message if you pass it any features with missing values. Hence, it is imperative to deal with missing values before feeding the data to the algorithm.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, GridSearchCV, RandomizedSearchCV, cross_val_score, cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, Imputer
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
%matplotlib inline
# Use pandas inbuild function to read in the learning and testing data
learning = pd.read_csv('./data/titanic/train.csv')
test = pd.read_csv('./data/titanic/test.csv')
# Make a list of both data sets to conveniently apply data wrangling to all the data
full_data = [learning, test]
learning.head(5)
PassengerId: A serial number for the passengers
Pclass: The class of ticket the passenger purchased (1 = 1st, 2 = 2nd, 3 = 3rd)
Name: The name of each passenger
Sex: Whether the passenger was male or female
Age: The age of the passenger
SibSp: The number of passengers who were either siblings or a spouse of the passenger in question
Parch: The number of passengers who were either parents or children of the passenger in question
Ticket: The ticket number of the ticket purchased by the passenger
Fare: The price paid for the ticket
Cabin: The cabin number allocated to the passenger. If NaN, it implies the passenger did not have a cabin
Embarked: The port that the passenger embarked from
Survived: Whether the passenger survived the sinking of the Titanic
# Get the data types of each variable as a dataframe
learning.dtypes.to_frame(name = 'data_type')
# Get summary statistics of all numerical variables
learning.describe()
# Get descriptive statistics of non-numerical variables
learning.describe(exclude = ['int64', 'float64'])
In spite of having no missing values, there seem to be only 681 unique ticket numbers. Let's explore to understand why each passenger didn't have a unique ticket number.
learning['Ticket'].value_counts().to_frame(name = 'Count').head(5)
# Let's explore a few repeating ticket numbers
learning[learning['Ticket'] == "347082"]
learning[learning['Ticket'] == "CA. 2343"]
It appears that repeated ticket numbers signify families traveling together. We can deduce this from the common last names of the family.
Since the NaN values in the cabin variable signify that the passenger was without a cabin, we can adjust the column to a boolean variable. Furthermore, sklearn's decision tree algorithm requires categorical variables to be encoded into numbers.
# Change the Cabin variable to a boolean in both test and learning dataframes
for data in full_data:
data['Cabin'] = data['Cabin'].fillna(0)
data['Cabin'][data['Cabin'] != 0] =1
learning.head(5)
Lets visualize the variables to have a better understanding of their distribution.
learning['Sex'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Sex')
plt.xlabel('Sex')
plt.ylabel('Frequency')
plt.show()
learning['Cabin'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Passengers with and without Cabins')
plt.xlabel('Cabin')
plt.ylabel('Frequency')
plt.show()
learning['Pclass'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Frequency')
plt.show()
learning['Embarked'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of port of Embarkation')
plt.xlabel('Port of Embarkation')
plt.ylabel('Frequency')
plt.show()
learning['Age'].hist(bins = 20, alpha = 0.8)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
learning['Parch'].hist(bins = 6, alpha = 0.8)
plt.title('Distribution of the number of passengers who were either parents or children of the passenger in question')
plt.xlabel('Number of Parents and/or children')
plt.ylabel('Frequency')
plt.show()
learning['SibSp'].hist(bins = 8, alpha = 0.8)
plt.title('Distribution of the number of passengers who were either siblings or a spouse of the passenger in question')
plt.xlabel('Number of Spouse and Siblings')
plt.ylabel('Frequency')
plt.show()
learning['Fare'].hist(bins = 10, alpha = 0.8)
plt.title('Distribution of Fare')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()
It is good idea to see how a feature's distribution varies based on the classes of our categorical response variable. Features that show a good amount of disparity could serve as strong predictors
import warnings
warnings.filterwarnings('ignore')
sns.distplot(learning[learning['Survived'] == 1][['Age']].dropna(),
kde = False, color = 'red', label = 'Survived').set_title('Distribution of Age')
sns.distplot(learning[learning['Survived'] == 0][['Age']].dropna(),
kde = False, color = 'grey', label = 'Did not Survive').set_title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()
sns.distplot(learning[learning['Survived'] == 1][['SibSp']].dropna(), bins =5,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['SibSp']].dropna(), bins =5,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Sum of number of Siblings and Spouse')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the number of siblings and spouse they had')
plt.legend()
plt.show()
sns.distplot(learning[learning['Survived'] == 1][['Parch']].dropna(), bins =8,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['Parch']].dropna(), bins =8,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Sum of number of Parents and Children')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the number of parents and children they had')
plt.legend()
plt.show()
sns.distplot(learning[learning['Survived'] == 1][['Fare']].dropna(), bins = 10,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['Fare']].dropna(), bins = 10,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Fare paid')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the fare they paid')
plt.legend()
plt.show()
#Get summary counts for each port of embarkation for passengers that survived
embarked_summary_survived = learning[learning['Survived'] == 1]['Embarked'].dropna().value_counts().to_frame(name='Count')
#Get summary counts for each port of embarkation for passengers that did not survive
embarked_summary_did_not_survive = learning[learning['Survived'] == 0]['Embarked'].dropna().value_counts().to_frame(name='Count')
# Append the two data frames into one
embarked_summary = embarked_summary_survived.append(embarked_summary_did_not_survive)
# Add a column for whehter the count represents passengers that survived or not
embarked_summary['Survived'] = ['Yes', 'Yes', 'Yes', 'No', 'No', 'No']
# Add a column that states port of embarkation
embarked_summary['Embarked'] = embarked_summary.index
# Reset the index to numbers
embarked_summary.reset_index(level = 0, inplace = True)
# Drop the additional index column
embarked_summary = embarked_summary.drop(['index'], axis = 1)
# Final Summary
embarked_summary
sns.catplot(x="Embarked", y="Count", hue="Survived", data=embarked_summary,
height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive per port of Embarkation")
plt.show()
#Get summary counts for each passengers with or without cabins that survived
cabin_summary_survived = learning[learning['Survived'] == 1]['Cabin'].dropna().value_counts().to_frame(name='Count')
#Get summary counts for each port of embarkation for passengers that did not survive
cabin_summary_did_not_survive = learning[learning['Survived'] == 0]['Cabin'].dropna().value_counts().to_frame(name='Count')
# Append the two data frames into one
cabin_summary = cabin_summary_survived.append(cabin_summary_did_not_survive)
# Add a column for whehter the count represents passengers that survived or not
cabin_summary['Survived'] = ['Yes', 'Yes', 'No', 'No']
# Add a column that states port of embarkation
cabin_summary['Cabin'] = cabin_summary.index
# Reset the index to numbers
cabin_summary.reset_index(level = 0, inplace = True)
# Drop the additional index column
cabin_summary = cabin_summary.drop(['index'], axis = 1)
# Final Summary
cabin_summary
sns.catplot(x="Cabin", y="Count", hue="Survived", data=cabin_summary,
height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by based on whether that did or did not have a cabin")
plt.show()
#Get summary counts for each port of embarkation for passengers that survived
sex_summary_survived = learning[learning['Survived'] == 1]['Sex'].dropna().value_counts().to_frame(name='Count')
#Get summary counts for each port of embarkation for passengers that did not survive
sex_summary_did_not_survive = learning[learning['Survived'] == 0]['Sex'].dropna().value_counts().to_frame(name='Count')
# Append the two data frames into one
sex_summary = sex_summary_survived.append(sex_summary_did_not_survive)
# Add a column for whehter the count represents passengers that survived or not
sex_summary['Survived'] = ['Yes', 'Yes', 'No', 'No']
# Add a column that states port of embarkation
sex_summary['Sex'] = sex_summary.index
# Reset the index to numbers
sex_summary.reset_index(level = 0, inplace = True)
# Drop the additional index column
sex_summary = sex_summary.drop(['index'], axis = 1)
# Final Summary
sex_summary
sns.catplot(x="Sex", y="Count", hue="Survived", data=sex_summary,
height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by sex")
plt.show()
#Get summary counts for each port of embarkation for passengers that survived
pclass_summary_survived = learning[learning['Survived'] == 1]['Pclass'].dropna().value_counts().to_frame(name='Count')
#Get summary counts for each port of embarkation for passengers that did not survive
pclass_summary_did_not_survive = learning[learning['Survived'] == 0]['Pclass'].dropna().value_counts().to_frame(name='Count')
# Append the two data frames into one
pclass_summary = pclass_summary_survived.append(pclass_summary_did_not_survive)
# Add a column for whehter the count represents passengers that survived or not
pclass_summary['Survived'] = ['Yes', 'Yes', 'Yes', 'No', 'No', 'No']
# Add a column that states port of embarkation
pclass_summary['Pclass'] = pclass_summary.index
# Reset the index to numbers
pclass_summary.reset_index(level = 0, inplace = True)
# Drop the additional index column
pclass_summary = pclass_summary.drop(['index'], axis = 1)
# Final Summary
pclass_summary
sns.catplot(x="Pclass", y="Count", hue="Survived", data=pclass_summary,
height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by class of ticket")
plt.show()
Like the cabin variable, we need to encode the categorical variable "sex" into a numerical one for it to be accept by sklearn's decision tree algorithm
le = LabelEncoder()
le.fit(['male','female'])
for data in full_data:
data['Sex'] = le.transform(data[['Sex']])
learning.head(5)
Again, encoding the variable to work with sklearn. However, there are missing values in the 'Embarked' column and we are gonna replace them with 'S', the most common port of entry. This might not be the true port of embarkation, however it seems like the most intuitive thing to do considering that it has the highest frequency amongst the three ports.
le_embark = LabelEncoder()
le_embark.fit(['S','C','Q'])
for data in full_data:
data[['Embarked']] = data[['Embarked']].fillna('S')
data['Embarked'] = le_embark.transform(data[['Embarked']])
learning.head(5)
There are several ways to deal with missing values in a data set. You could drop the row/data point that has the missing value or if a feature has too many missing values drop the entire column. While dropping rows and columns seem like an easy fix, there's obviously a big downside to these strategies in that you lose viable data.
Other mathematically viable ways to impute the missing values with a sensible replacement are often considered better alternatives to dropping data points and features. In particular, single and multiple imputation methods are the two strategies that we could consider.
For the sake of this project, we will use single imputation with the mean.
imputer = Imputer()
learning['Age'] = imputer.fit_transform(learning[['Age']])
test['Age'] = imputer.fit_transform(test[['Age']])
Let's create a base decision tree model to try an see what our accuracy is based on the default set of features to understand our starting point.
This base model could then be improved by using several techniques
# Splitting the data into training and validations sets
X = learning.drop(['Survived','Name', 'PassengerId', 'Ticket'], axis = 1)
y = learning[['Survived']]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
# Training the Base model
base_decision_tree = tree.DecisionTreeClassifier()
base_decision_tree.fit(X_train, y_train)
# Training score
base_decision_tree.score(X_train, y_train)
# Validation Score
base_decision_tree.score(X_valid, y_valid)
The base model has an accuracy score of 98.5% on the training set and a 78.77% accuracy on the validation set.
pd.DataFrame(data = base_decision_tree.feature_importances_,
index = list(X_train.columns),
columns = ['Importance']).sort_values(by = ['Importance'], axis = 0, ascending = False)
As postulated, we can see that Sex is one of the most important features when deciding on whether a passenger survived or not. While Age and Fare were not intuitively strong predictors - based on the feature disparity visualizations - the algorithm believes they are strong predictors.
Of the other variables in our consideration, passenger class and cabin seem to be moderately important to the algorithm.
It's important to note, that these feature importances are produced using the current hyperparameters. Adjusting the hyperparameters may result in a different set of feature importances. </font>
with open("tree1.dot", 'w') as f:
f = tree.export_graphviz(base_decision_tree,
out_file=f,
impurity = True,
feature_names = list(X_train.columns),
class_names = ['Died', 'Survived'],
rounded = True,
filled= True )
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree1.dot','-o','tree1.png'])
img = Image.open("./tree1.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png")
As we can see, this base model decision tree has a very large depth and could be considered as one that is over-fitting given the last discrepancy between the training and validation error. Let's try to rectify this and improve our model
Considering the ubiquity of the titanic data-set, there is extensive work done on it and we are gonna try to take inspiration from some of the work that's out there. Sina, Anisotropic, Diego_Milia and also Megan Risdal
learning.head()
for data in full_data:
data['Family_size'] = data['SibSp'] + data['Parch'] + 1
data['IsAlone'] = 0
data['IsAlone'][data['Family_size'] == 1] = 1
test.head(5)
sns.distplot(learning[learning['Survived'] == 1][['Family_size']].dropna(), bins =10,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['Family_size']].dropna(), bins =10,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Family Size')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the size of family on-board')
plt.legend()
plt.show()
Let's drop the smaller family sizes and gain some more resolution on the larger family sizes.
sns.distplot(learning[(learning['Survived'] == 1) & (learning['Family_size'] >=3)][['Family_size']].dropna(), bins =8,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[(learning['Survived'] == 0) & (learning['Family_size'] >=3)][['Family_size']].dropna(), bins =8,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Family Size')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the size of family on-board')
plt.legend()
plt.show()
The above visualization shows, that larger the family size of a person less likely are they to survive.
summary_isalone = pd.DataFrame(learning.groupby(['Survived','IsAlone']).size(), columns= ['Count'])
summary_isalone.reset_index(level=0, inplace=True)
summary_isalone.reset_index(level=0, inplace=True)
summary_isalone
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
my_pal = {1: "red", 0: "grey"}
sns.catplot(ax = ax, data= summary_isalone, x="IsAlone", y="Count", hue="Survived",
kind="bar", palette=my_pal, alpha = 0.5)
plt.close(2)
plt.show()
As we can see from the name column, there are titles such as Mr. and Mrs. embedded within the name. Let's extract this information and form a title variable. Using a feature such as the title allows us to extract a little more information about each passenger. For instance, a Master or a Sir would be have a higher chance of survival due to their elevated social status.
Title: The salutation in name of the passenger</font>
re_res = re.search('([A-Za-z]+)\.', learning['Name'][0])
re_res.group(1)
learning['Name'][0]
def extract_title(name):
title_search = re.search('([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
for data in full_data:
data['Title'] = data['Name'].apply(extract_title)
test.head()
summary_title = pd.DataFrame(learning.groupby(['Survived','Title']).size(), columns= ['Count'])
summary_title.reset_index(level=0, inplace=True)
summary_title.reset_index(level=0, inplace=True)
summary_title
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
my_pal = {1: "red", 0: "grey"}
sns.catplot(ax = ax, data= summary_title, x="Title", y="Count", hue="Survived",
kind="bar",palette=my_pal, alpha = 0.5)
plt.close(2)
plt.show()
It appears that the Miss and Mrs Titles have a higher likelihood of survival, while the Mr Title has a lower likelihood of survival.
Let's zoom into the graph a little more by dropping the most common titles.</font>
summary_title_reduced = summary_title[(summary_title['Title'] != 'Master') &
(summary_title['Title'] != 'Miss') &
(summary_title['Title'] != 'Mr') &
(summary_title['Title'] != 'Mrs') ]
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
sns.catplot(ax = ax, data= summary_title_reduced, x="Title", y="Count", hue="Survived",
kind="bar", palette=my_pal, alpha = 0.5)
plt.close(2)
plt.show()
Interestingly, none of the passengers with a title of 'Rev' survived, while all of the passengers with a title of 'Countess', 'Lady', 'Mike', "Mme', 'Ms', and 'Sir' survived. This makes title a quite a valuable predictor.
Finally, let's label encode our title variable
le = LabelEncoder()
le.fit(np.unique(np.concatenate((test.Title.unique(), learning.Title.unique()))))
for data in full_data:
data['Title'] = le.transform(data[['Title']])
learning.head(5)
test.head()
Family_size: Given that we have information about the parent/children and siblings/spouse of each passenger, let's create a family size feature which would constitute the total number of family member on-board the titanic for each passenger including themselves
IsAlone:A categorical variable that is set to 1 if Family_size == 1, and 0 otherwise
Title: The salutation in name of the passenger such as Mr , Ms Mrs, Countess
</font>
Let's summarize the insights we gained through our engineered features
Family_size: Passengers with larger family sizes seemed less likely to survive
IsAlone: Being the only member on-board also seemed to have a higher mortality rate
Title: Lastly, the title of the passenger was a strong indicator of their survival. The titles of Miss and Mrs had a higher likelihood of survival, while the Mr Title had a lower likelihood of survival. Additionally, none of the passengers with a title of 'Rev' survived, while all of the passengers with a title of 'Countess', 'Lady', 'Mike', "Mme', 'Ms', and 'Sir' survived. This makes title a quite a valuable predictor.
Drop Sex: Since most of the information we have about sex would be implied in the Title of the passenger, we can drop sex to avoid redundancy in our features.
</font>
learning = learning.drop('Sex', axis = 1)
test = test.drop('Sex', axis = 1)
test.head()
There are four strategies that can be used to optimize hyperparameters of an algorithm, all of which require you to have handy a validation set.
X = learning.drop(['Survived','Name', 'PassengerId', 'Ticket'], axis = 1)
y = learning[['Survived']]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
test_depths = range(1,10)
test_depths
training_scores = []
validation_scores = []
for depth in test_depths:
updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, class_weight='balanced', random_state=42)
updated_decision_tree.fit(X_train, y_train)
training_scores.append(updated_decision_tree.score(X_train, y_train))
validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
print(validation_scores)
plt.plot(test_depths,training_scores, color = 'blue')
plt.plot(test_depths,validation_scores, color = 'green')
plt.title("Accuracy Scores vs max_depth")
plt.xlabel("Max Depth of Tree")
plt.ylabel("Accuracy Scores")
plt.legend(labels = ['training','validation'])
plt.show()
random_min_sample_splits = np.random.randint(low=2, high=60, size=30)
random_min_sample_splits.sort()
random_min_sample_splits = np.unique(random_min_sample_splits)
training_scores = []
validation_scores = []
for split_size in random_min_sample_splits:
updated_decision_tree = tree.DecisionTreeClassifier(min_samples_split = split_size, class_weight='balanced',
random_state=42)
updated_decision_tree.fit(X_train, y_train)
training_scores.append(updated_decision_tree.score(X_train, y_train))
validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
plt.plot(random_min_sample_splits,training_scores, color = 'blue')
plt.plot(random_min_sample_splits,validation_scores, color = 'green')
plt.title("Accuracy Scores vs min_samples_split")
plt.xlabel("Min. Sample Split at node")
plt.ylabel("Accuracy Scores")
plt.legend(labels = ['training','validation'])
plt.show()
training_scores = []
validation_scores = []
avg_cross_val_scores = []
for depth in test_depths:
updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, class_weight='balanced', random_state=42)
updated_decision_tree.fit(X_train, y_train)
training_scores.append(updated_decision_tree.score(X_train, y_train))
validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
# Computing additional Cross validation score.
avg_cross_val_scores.append(np.mean(cross_val_score(updated_decision_tree, X, y, cv=10)))
plt.plot(test_depths,training_scores, color = 'blue')
plt.plot(test_depths,validation_scores, color = 'green')
plt.plot(test_depths,avg_cross_val_scores, color = 'red')
plt.title("Accuracy Scores vs max_depth")
plt.xlabel("Max Depth of Tree")
plt.ylabel("Accuracy Scores")
plt.legend(labels = ['training','validation', 'avg_cross_validation'])
plt.show()
training_scores = []
validation_scores = []
avg_cross_val_scores = []
for split_size in random_min_sample_splits:
updated_decision_tree = tree.DecisionTreeClassifier(min_samples_split = split_size,
class_weight='balanced', random_state=42)
updated_decision_tree.fit(X_train, y_train)
training_scores.append(updated_decision_tree.score(X_train, y_train))
validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
# Computing additional Cross validation score.
avg_cross_val_scores.append(np.mean(cross_val_score(updated_decision_tree, X, y, cv=10)))
plt.plot(random_min_sample_splits,training_scores, color = 'blue')
plt.plot(random_min_sample_splits,validation_scores, color = 'green')
plt.plot(random_min_sample_splits,avg_cross_val_scores, color = 'red')
plt.title("Accuracy Scores vs min_samples_split")
plt.xlabel("Min. Sample Split at node")
plt.ylabel("Accuracy Scores")
plt.legend(labels = ['training','validation', 'avg_cross_validation'])
plt.show()
In general, the avg_cross_validation accuracy score serves as a stronger indicator of the model's accuracy on unseen data given that it has been tested on more than a single validation set. However, in both the cases above, we have tried to optimize a single hyperparameter at a time. Let's try to optimize a couple of them simultaneously.
#test_dict = dict.fromkeys(['max_depth', 'split_size', 'training', 'validation', 'avg_cross_validaion'])
#test_dict['max_depth'].append(1)
keys = ['max_depth', 'min_samples_split', 'training', 'validation', 'avg_cross_validaion']
test_dict = {key: [] for key in keys}
#test_dict['max_depth'].append(1)
test_dict
for depth in test_depths:
for split_size in random_min_sample_splits:
updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, min_samples_split = split_size,
class_weight='balanced', random_state=42)
updated_decision_tree.fit(X_train, y_train)
test_dict['max_depth'].append(depth)
test_dict['min_samples_split'].append(split_size)
test_dict['training'].append(updated_decision_tree.score(X_train, y_train))
test_dict['validation'].append(updated_decision_tree.score(X_valid, y_valid))
test_dict['avg_cross_validaion'].append(np.mean(cross_val_score(updated_decision_tree, X, y, cv=10)))
scores_pd = pd.DataFrame(test_dict)
scores_pd.head()
g = sns.FacetGrid(scores_pd, col='max_depth', col_wrap=3)
g.map(sns.lineplot, "min_samples_split", "training", alpha=.7 , color = 'blue')
g.map(sns.lineplot, "min_samples_split", "validation", alpha=.7 , color = 'green')
g.map(sns.lineplot, "min_samples_split", "avg_cross_validaion", alpha=.7 , color = 'red')
plt.legend(labels = ['training','validation', 'avg_cross_validation'])
plt.show()
It seems like a max depth of 5 with a small min_sample_split size works the best both in terms of accuracy and optimizing the variance-bias trade of.
scores_pd[(scores_pd['max_depth'] == 5) & (scores_pd['min_samples_split'] == 6)]
In essence this was the manual process of doing grid search cross validation. As explained above, with GridSearch CV you give the algorithm a set of values for each hyperparameter. The algorithm will then take the cartesian product of these sets and try every combination of the hyperparameters and return the hyperparmeters that yield the highest accuracy.
parameters = {'max_depth':(1,2,3,4,5,6,7,8,9),
'min_samples_split':[6,10,11,13,14,15,17,18,19,20,21,26,27,33,38,43,44,45,48,49,50,51,56,59]}
gridCV_decision_tree = tree.DecisionTreeClassifier(class_weight='balanced', random_state=42)
clf = GridSearchCV(gridCV_decision_tree, parameters, cv=10)
clf.fit(X, y)
clf.best_params_
clf.best_score_
As can be seen, give the same set of values for the hyperparameters, GridSearch CV resulted in selected the max_depth of 5 and min_samples_split of 6, just like as we did with our simultaneous k-fold cross validation on the two hyperparameters.
Like GridSearchCV, RandomSearchCV also takes a dictionary of hyperparameter values to search over. However, unlike the exhaustive search of GridSearch that checks every combination of hyperparameter values, RandomSearchCV random selects values to try. The number of combinations of parameters tried can be set using the n_iter parameter when initializing the RandomSearchCV object.
parameters = {'max_depth':(1,2,3,4,5,6,7,8,9),
'min_samples_split':[6,10,11,13,14,15,17,18,19,20,21,26,27,33,38,43,44,45,48,49,50,51,56,59]}
randomCV_decision_tree = tree.DecisionTreeClassifier(class_weight='balanced', random_state=42)
random_search = RandomizedSearchCV(randomCV_decision_tree, param_distributions=parameters,
n_iter=20, cv=10, iid=False, random_state = 42)
random_search.fit(X, y)
random_search.best_params_
random_search.best_score_
g_sub = pd.read_csv('./data/Titanic/gender_submission.csv')
test_w_response = pd.merge(test, g_sub, on='PassengerId', how='inner')
test_w_response.head()
test_x = test_w_response.drop(['PassengerId', 'Name', 'Survived', 'Ticket'], axis = 1)
test_y = test_w_response[['Survived']]
# Adjust the one missing Fare with the mean
test_x['Fare'][test_x['Fare'].isna()] = np.mean(test_x['Fare'])
final_model = tree.DecisionTreeClassifier(max_depth = 5, min_samples_split = 6,
class_weight='balanced', random_state=42)
final_model.fit(X,y)
test_predict_y = final_model.predict(test_x)
accuracy_score(test_y, test_predict_y)
with open("tree_final.dot", 'w') as f:
f = tree.export_graphviz(final_model,
out_file=f,
max_depth = 5,
impurity = True,
feature_names = list(X.columns),
class_names = ['Died', 'Survived'],
rounded = True,
filled= True )
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree_final.dot','-o','tree_final.png'])
img_final = Image.open("./tree_final.png")
draw = ImageDraw.Draw(img_final)
img_final.save('final-out.png')
PImage("final-out.png")